The main goal of this analysis was to perform most common algorithm used to observe what people purchase.
This dataset gives us information about the things people purchase when they go to a shop.
The data is taken from kaggle platform. https://www.kaggle.com/gorkhachatryan01/purchase-behaviour
Then we do reproducibility of my methods on different datasets: https://www.kaggle.com/roshansharma/market-basket-optimization/version/1
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
setwd("C:/Users/wangz/Desktop")
md = read.transactions("dataset.csv",format = "basket",
sep = ",",skip = 0, header = TRUE)
dim(md)
## [1] 1498 38
#average number of items
ave_size = mean(size(md));
ave_size
## [1] 10.34913
summary(md)
## transactions as itemMatrix in sparse format with
## 1498 rows (elements/itemsets/transactions) and
## 38 columns (items) and a density of 0.2723456
##
## most frequent items:
## vegetables poultry waffles bagels lunch meat (Other)
## 894 431 418 417 413 12930
##
## element (itemset/transaction) length distribution:
## sizes
## 3 4 5 6 7 8 9 10 11 12 13 14
## 8 57 51 51 71 74 95 191 304 320 212 64
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 11.00 10.35 12.00 14.00
##
## includes extended item information - examples:
## labels
## 1 all- purpose
## 2 aluminum foil
## 3 bagels
# relative frequency
round(itemFrequency(md, type="relative"),4)
## all- purpose aluminum foil
## 0.2630 0.2637
## bagels beef
## 0.2784 0.2623
## butter cereals
## 0.2610 0.2737
## cheeses coffee/tea
## 0.2603 0.2630
## dinner rolls dishwashing liquid/detergent
## 0.2583 0.2684
## eggs flour
## 0.2690 0.2570
## fruits hand soap
## 0.2637 0.2377
## ice cream individual meals
## 0.2750 0.2717
## juice ketchup
## 0.2577 0.2503
## laundry detergent lunch meat
## 0.2644 0.2757
## milk mixes
## 0.2710 0.2737
## paper towels pasta
## 0.2550 0.2717
## pork poultry
## 0.2497 0.2877
## sandwich bags sandwich loaves
## 0.2497 0.2490
## shampoo soap
## 0.2477 0.2657
## soda spaghetti sauce
## 0.2737 0.2543
## sugar toilet paper
## 0.2670 0.2704
## tortillas vegetables
## 0.2443 0.5968
## waffles yogurt
## 0.2790 0.2684
# plot for relative frequency
itemFrequencyPlot(
md,
topN = 10,
type = "relative",
main = "Item frequency",
cex.names = 0.85
)
#absolute frequency
itemFrequency(md, type="absolute")
## all- purpose aluminum foil
## 394 395
## bagels beef
## 417 393
## butter cereals
## 391 410
## cheeses coffee/tea
## 390 394
## dinner rolls dishwashing liquid/detergent
## 387 402
## eggs flour
## 403 385
## fruits hand soap
## 395 356
## ice cream individual meals
## 412 407
## juice ketchup
## 386 375
## laundry detergent lunch meat
## 396 413
## milk mixes
## 406 410
## paper towels pasta
## 382 407
## pork poultry
## 374 431
## sandwich bags sandwich loaves
## 374 373
## shampoo soap
## 371 398
## soda spaghetti sauce
## 410 381
## sugar toilet paper
## 400 405
## tortillas vegetables
## 366 894
## waffles yogurt
## 418 402
#plot for absolute frequency
itemFrequencyPlot(
md,
topN = 10,
type = "absolute",
main = "Item frequency",
cex.names = 0.85
)
The figure above shows the 10 most popular purchases. Vegetables is first, then poultry and waffles.
#Plot for min support
itemFrequencyPlot(md, support = 0.1) #minimum support at 10%
I use the Apriori algorithm. To simplify the analysis, I used the values: Confidence = 0.4, support = 0.1 After calculations, the algorithm found 38 rules.
rules = apriori(md, parameter = list(supp = 0.1, conf = 0.4))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 149
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[38 item(s), 1498 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [38 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Support is a measure of how often a certain subset of items appeared in the whole data.
rules_supp = sort(rules, by = "support", decreasing = TRUE)
rules_supp_table = inspect(head(rules_supp), linebreak = FALSE)
## lhs rhs support confidence coverage lift
## [1] {} => {vegetables} 0.5967957 0.5967957 1.0000000 1.000000
## [2] {yogurt} => {vegetables} 0.1762350 0.6567164 0.2683578 1.100404
## [3] {poultry} => {vegetables} 0.1748999 0.6078886 0.2877170 1.018587
## [4] {laundry detergent} => {vegetables} 0.1728972 0.6540404 0.2643525 1.095920
## [5] {lunch meat} => {vegetables} 0.1715621 0.6222760 0.2757009 1.042695
## [6] {cereals} => {vegetables} 0.1702270 0.6219512 0.2736983 1.042151
## count
## [1] 894
## [2] 264
## [3] 262
## [4] 259
## [5] 257
## [6] 255
Confidence is a measure of how likely it is that the consumer buys product Y (rhs) if he has product/products X (lhs) in his basket.
rules_conf = sort(rules, by = "confidence", decreasing = TRUE)
rules_conf_table = inspect(head(rules_conf), linebreak = FALSE)
## lhs rhs support confidence coverage lift
## [1] {yogurt} => {vegetables} 0.1762350 0.6567164 0.2683578 1.100404
## [2] {laundry detergent} => {vegetables} 0.1728972 0.6540404 0.2643525 1.095920
## [3] {eggs} => {vegetables} 0.1695594 0.6302730 0.2690254 1.056095
## [4] {lunch meat} => {vegetables} 0.1715621 0.6222760 0.2757009 1.042695
## [5] {cereals} => {vegetables} 0.1702270 0.6219512 0.2736983 1.042151
## [6] {flour} => {vegetables} 0.1595461 0.6207792 0.2570093 1.040187
## count
## [1] 264
## [2] 259
## [3] 254
## [4] 257
## [5] 255
## [6] 239
Lift is understood as a measure of sorts correlation.
rules_lift = sort(rules, by = "lift", decreasing = TRUE)
rules_lift_table = inspect(head(rules_lift), linebreak = FALSE)
## lhs rhs support confidence coverage lift
## [1] {yogurt} => {vegetables} 0.1762350 0.6567164 0.2683578 1.100404
## [2] {laundry detergent} => {vegetables} 0.1728972 0.6540404 0.2643525 1.095920
## [3] {eggs} => {vegetables} 0.1695594 0.6302730 0.2690254 1.056095
## [4] {lunch meat} => {vegetables} 0.1715621 0.6222760 0.2757009 1.042695
## [5] {cereals} => {vegetables} 0.1702270 0.6219512 0.2736983 1.042151
## [6] {flour} => {vegetables} 0.1595461 0.6207792 0.2570093 1.040187
## count
## [1] 264
## [2] 259
## [3] 254
## [4] 257
## [5] 255
## [6] 239
we can see the result, for all of them, Lift values are higher than 1. So we can say that rhs products are more likely to be bought with other products (lhs list) than if they were independent.
plot(rules, engine="plotly")
In our data, vegetables is the most frequent product in the basket analysis, we cannot observe any rules. So let’s use another product as our rhs: I will take Ice cream
rules_ice_cream = apriori(
data = md,
parameter = list(supp = 0.01, conf = 0.4),
appearance = list(default = "lhs", rhs = "ice cream"),
control = list(verbose = F)
)
rules_ice_cream_table = inspect(rules_ice_cream, linebreak = FALSE)
## lhs rhs
## [1] {hand soap,spaghetti sauce,vegetables} => {ice cream}
## [2] {cereals,paper towels,sandwich loaves} => {ice cream}
## [3] {all- purpose,lunch meat,spaghetti sauce} => {ice cream}
## [4] {aluminum foil,pasta,spaghetti sauce} => {ice cream}
## [5] {dishwashing liquid/detergent,flour,paper towels} => {ice cream}
## [6] {aluminum foil,paper towels,soda} => {ice cream}
## [7] {aluminum foil,coffee/tea,soda} => {ice cream}
## [8] {aluminum foil,juice,milk} => {ice cream}
## [9] {aluminum foil,beef,yogurt} => {ice cream}
## [10] {aluminum foil,beef,vegetables} => {ice cream}
## [11] {aluminum foil,milk,toilet paper} => {ice cream}
## support confidence coverage lift count
## [1] 0.01001335 0.4054054 0.02469960 1.474023 15
## [2] 0.01001335 0.4838710 0.02069426 1.759317 15
## [3] 0.01001335 0.4054054 0.02469960 1.474023 15
## [4] 0.01001335 0.5000000 0.02002670 1.817961 15
## [5] 0.01001335 0.5000000 0.02002670 1.817961 15
## [6] 0.01001335 0.4838710 0.02069426 1.759317 15
## [7] 0.01134846 0.4594595 0.02469960 1.670559 17
## [8] 0.01001335 0.5000000 0.02002670 1.817961 15
## [9] 0.01001335 0.4545455 0.02202937 1.652692 15
## [10] 0.01802403 0.4576271 0.03938585 1.663897 27
## [11] 0.01134846 0.4358974 0.02603471 1.584889 17
Due to fewer transactions of this type, I reduce the initial support value to 0.01.
Due to the small sample, there is no clear pattern between the results of the analysis
plot(rules_ice_cream, engine="plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules_ice_cream, method="graph")
In this paper, I used mainly Apriori method for association rules. Despite results is not very good, I think that Association Rules are an interesting method of data analysis.
library(kableExtra)
library(arules)
library(arulesViz)
transactions = read.transactions(
"Market_Basket_Optimisation.csv",
format = "basket",
sep = ",",
skip = 0,
header = TRUE
)
transactions
## transactions in sparse format with
## 7500 transactions (rows) and
## 119 items (columns)
itemFrequencyPlot(
transactions,
topN = 20,
type = "absolute",
main = "Item frequency",
cex.names = 0.85
)
rules = apriori(transactions, parameter = list(supp = 0.01, conf = 0.40))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 75
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7500 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules = apriori(transactions, parameter = list(supp = 0.01, conf = 0.40))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 75
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7500 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_supp = sort(rules, by = "support", decreasing = TRUE)
rules_supp_table = inspect(head(rules_supp), linebreak = FALSE)
## lhs rhs support confidence
## [1] {ground beef} => {mineral water} 0.04093333 0.4165536
## [2] {olive oil} => {mineral water} 0.02746667 0.4178499
## [3] {soup} => {mineral water} 0.02306667 0.4564644
## [4] {ground beef,spaghetti} => {mineral water} 0.01706667 0.4353741
## [5] {ground beef,mineral water} => {spaghetti} 0.01706667 0.4169381
## [6] {chocolate,spaghetti} => {mineral water} 0.01586667 0.4047619
## coverage lift count
## [1] 0.09826667 1.748266 307
## [2] 0.06573333 1.753707 206
## [3] 0.05053333 1.915771 173
## [4] 0.03920000 1.827256 128
## [5] 0.04093333 2.394361 128
## [6] 0.03920000 1.698777 119
rules_supp_table %>%
kable() %>%
kable_styling()
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {ground beef} | => | {mineral water} | 0.0409333 | 0.4165536 | 0.0982667 | 1.748266 | 307 |
| [2] | {olive oil} | => | {mineral water} | 0.0274667 | 0.4178499 | 0.0657333 | 1.753707 | 206 |
| [3] | {soup} | => | {mineral water} | 0.0230667 | 0.4564644 | 0.0505333 | 1.915771 | 173 |
| [4] | {ground beef,spaghetti} | => | {mineral water} | 0.0170667 | 0.4353741 | 0.0392000 | 1.827256 | 128 |
| [5] | {ground beef,mineral water} | => | {spaghetti} | 0.0170667 | 0.4169381 | 0.0409333 | 2.394361 | 128 |
| [6] | {chocolate,spaghetti} | => | {mineral water} | 0.0158667 | 0.4047619 | 0.0392000 | 1.698777 | 119 |
rules_conf = sort(rules, by = "confidence", decreasing = TRUE)
rules_conf_table = inspect(head(rules_conf), linebreak = FALSE)
## lhs rhs support confidence
## [1] {eggs,ground beef} => {mineral water} 0.01013333 0.5066667
## [2] {ground beef,milk} => {mineral water} 0.01106667 0.5030303
## [3] {chocolate,ground beef} => {mineral water} 0.01093333 0.4739884
## [4] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266
## [5] {soup} => {mineral water} 0.02306667 0.4564644
## [6] {pancakes,spaghetti} => {mineral water} 0.01146667 0.4550265
## coverage lift count
## [1] 0.02000000 2.126469 76
## [2] 0.02200000 2.111207 83
## [3] 0.02306667 1.989319 82
## [4] 0.02360000 1.968075 83
## [5] 0.05053333 1.915771 173
## [6] 0.02520000 1.909736 86
rules_conf_table %>%
kable() %>%
kable_styling()
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {eggs,ground beef} | => | {mineral water} | 0.0101333 | 0.5066667 | 0.0200000 | 2.126469 | 76 |
| [2] | {ground beef,milk} | => | {mineral water} | 0.0110667 | 0.5030303 | 0.0220000 | 2.111207 | 83 |
| [3] | {chocolate,ground beef} | => | {mineral water} | 0.0109333 | 0.4739884 | 0.0230667 | 1.989319 | 82 |
| [4] | {frozen vegetables,milk} | => | {mineral water} | 0.0110667 | 0.4689266 | 0.0236000 | 1.968074 | 83 |
| [5] | {soup} | => | {mineral water} | 0.0230667 | 0.4564644 | 0.0505333 | 1.915771 | 173 |
| [6] | {pancakes,spaghetti} | => | {mineral water} | 0.0114667 | 0.4550265 | 0.0252000 | 1.909736 | 86 |
rules_lift = sort(rules, by = "lift", decreasing = TRUE)
rules_lift_table = inspect(head(rules_lift), linebreak = FALSE)
## lhs rhs support confidence
## [1] {ground beef,mineral water} => {spaghetti} 0.01706667 0.4169381
## [2] {eggs,ground beef} => {mineral water} 0.01013333 0.5066667
## [3] {ground beef,milk} => {mineral water} 0.01106667 0.5030303
## [4] {chocolate,ground beef} => {mineral water} 0.01093333 0.4739884
## [5] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266
## [6] {soup} => {mineral water} 0.02306667 0.4564644
## coverage lift count
## [1] 0.04093333 2.394361 128
## [2] 0.02000000 2.126469 76
## [3] 0.02200000 2.111207 83
## [4] 0.02306667 1.989319 82
## [5] 0.02360000 1.968075 83
## [6] 0.05053333 1.915771 173
rules_lift_table %>%
kable() %>%
kable_styling()
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {ground beef,mineral water} | => | {spaghetti} | 0.0170667 | 0.4169381 | 0.0409333 | 2.394361 | 128 |
| [2] | {eggs,ground beef} | => | {mineral water} | 0.0101333 | 0.5066667 | 0.0200000 | 2.126469 | 76 |
| [3] | {ground beef,milk} | => | {mineral water} | 0.0110667 | 0.5030303 | 0.0220000 | 2.111207 | 83 |
| [4] | {chocolate,ground beef} | => | {mineral water} | 0.0109333 | 0.4739884 | 0.0230667 | 1.989319 | 82 |
| [5] | {frozen vegetables,milk} | => | {mineral water} | 0.0110667 | 0.4689266 | 0.0236000 | 1.968074 | 83 |
| [6] | {soup} | => | {mineral water} | 0.0230667 | 0.4564644 | 0.0505333 | 1.915771 | 173 |
plot(rules, engine="plotly")
rules_chocolate = apriori(
data = transactions,
parameter = list(supp = 0.001, conf = 0.7),
appearance = list(default = "lhs", rhs = "chocolate"),
control = list(verbose = F)
)
rules_chocolate_table = inspect(rules_chocolate, linebreak = FALSE)
## lhs rhs
## [1] {red wine,tomato sauce} => {chocolate}
## [2] {almonds,olive oil,spaghetti} => {chocolate}
## [3] {almonds,milk,spaghetti} => {chocolate}
## [4] {escalope,french fries,shrimp} => {chocolate}
## [5] {burgers,olive oil,pancakes} => {chocolate}
## [6] {frozen vegetables,mineral water,pancakes,shrimp} => {chocolate}
## support confidence coverage lift count
## [1] 0.001066667 0.8000000 0.001333333 4.882018 8
## [2] 0.001066667 0.7272727 0.001466667 4.438198 8
## [3] 0.001066667 0.7272727 0.001466667 4.438198 8
## [4] 0.001066667 0.8888889 0.001200000 5.424464 8
## [5] 0.001200000 0.7500000 0.001600000 4.576892 9
## [6] 0.001066667 0.7272727 0.001466667 4.438198 8
rules_chocolate_table %>%
kable() %>%
kable_styling()
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {red wine,tomato sauce} | => | {chocolate} | 0.0010667 | 0.8000000 | 0.0013333 | 4.882018 | 8 |
| [2] | {almonds,olive oil,spaghetti} | => | {chocolate} | 0.0010667 | 0.7272727 | 0.0014667 | 4.438198 | 8 |
| [3] | {almonds,milk,spaghetti} | => | {chocolate} | 0.0010667 | 0.7272727 | 0.0014667 | 4.438198 | 8 |
| [4] | {escalope,french fries,shrimp} | => | {chocolate} | 0.0010667 | 0.8888889 | 0.0012000 | 5.424464 | 8 |
| [5] | {burgers,olive oil,pancakes} | => | {chocolate} | 0.0012000 | 0.7500000 | 0.0016000 | 4.576892 | 9 |
| [6] | {frozen vegetables,mineral water,pancakes,shrimp} | => | {chocolate} | 0.0010667 | 0.7272727 | 0.0014667 | 4.438198 | 8 |
plot(rules_chocolate, engine="plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules_chocolate, method="graph")
From this project,we could see that Association Rules are an extremely interesting method of data analysis which can relatively easily find out about many interesting relationships. And also, I did Reproducible Research by using same methods for another datasets, which prove the reproducibility of my code.